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1. Introduction 

Professor Jayanta Kumar Ghosh, or J. K. Ghosh, as he is commonly known, 
has been a prominent contributor to the disciphne of statistics for five decades. 
The spectrum of his contributions encompasses sequential analysis, the founda- 
tions of statistics, finite populations, Edgeworth expansions, second order efficiency, 
Bartlett corrections, noninformative, and especially matching, priors, semiparamet- 
ric inference, posterior limit theorems, Bayesian nonparametrics, model selection 
Bayesian hypothesis testing and high dimensional data analysis, as well as some 
applied work in reliability theory, statistical quality control, modeling hydrocarbon 
discoveries, geological mapping and DNA fingerprinting. By itself, covering such 
diverse topics in depth is a major career achievement. He has authored over 130 
publications including three monographs and several edited volumes. His books, 
one entitled Higher Order Asymptotics and published as an IMS monograph and 
another entitled Bayesian Nonparametrics, co-authored by R. V. Ramamoorthi and 
published by Springer- Verlag, continue to hold respected positions for researchers 
in these areas. His recently published third book [,'>-l] is a fine graduate text on 
Bayesian inference. In addition, his service to the profession, especially as the edi- 
tor of Sankhya, has been invaluable. 

The variety of his work notwithstanding, asymptotics has been central to his 
thinking across a wide range of problems. Accordingly, in what follows, we outline 
some of his work, in roughly chronological order, focussing on those contributions 
which are intimately connected to asymptotics. In the course of reviewing his work, 
we try to characterize the progression of thinking that naturally connects the topics 
that J. K. Ghosh has done so much to develop. 

2. Sequential analysis 

J. K. Ghosh started his research career in Sequential Analysis in the early sixties as a 
Graduate Student in the Department of Statistics at Calcutta University. Wald had 
recently introduced his sequential probability ratio test (SPRT), but its properties 
were not well understood in the composite case. This was the first topic to which 
Ghosh turned his attention. Through his work, many of the properties of SPRT 
and related procedures were established and better understood. For instance, in the 
testing context, double minimaxity essentially means simultaneous minimization of 
average type I and II error probabihties. In his first pubhshed work [2G], Ghosh 
clarified a result of Wald on the double minimaxity of the SPRT for normal two- 
sided alternative hypothesis (with unknown scale) separated from the null by S. 

It is well-known that the power function is monotonic in many common families 
for fixed sample sizes. Ghosh established an analog of this result in [27], namely 
that the operating characteristic function of the (generalized) SPRT continues to 
be monotonic. Also in the sequential context, [28] considered the admissibility of 
sequential tests based on a simple identity which later became known as the Ghosh- 
Pratt identity. Ghosh compared the SPRT not just with the class of all tests with 
finite expected sample size but also within other classes, for instance, the class which 
requires at least one observation or which requires no more than a predetermined 
number of observations to reach a conclusion. 

Following this, Ghosh continued to elucidate more properties of the SPRT, 
and its variants, which could be seen as analogs of the corresponding properties 
Neyman-Pearson or Bayes tests for fixed sample size. In [2!)], he proved that for 
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exponential families, truncated or untruncated Bayesian sequential decision rules' 
terminal decisions describe regions in terms of sufficient statistics, and also showed 
that for testing problems, truncated generalized SPRT's form a complete class. 

About two decades later, Ghosh returned to sequential problems, along with 
various co-authors. In [-V-l], he studied an invariant SPRT to identify two normal 
populations with equal variance and obtained bounds for error probabilities. Most 
recently, similar bounds for an invariant SPRT with respect to an improper prior 
have also been obtained in [-^O]. 

Two-stage procedures are closely related to sequential procedures. Recall Stein's 
famous problem of finding a bounded length confidence interval for the normal mean 
with unknown variance. Stein proposed a two-stage procedure for doing this: In the 
first stage, the sample variance determines how many samples are to be taken in 
the second stage. An obvious shortcoming of the procedure is that the second stage 
sample variance is not used in the construction of the interval. So, it is natural to 
ask whether one can improve Stein's procedure by using the second stage sample 
variance. Surprisingly, it is impossible to better Stein's procedure, as shown in [■'SS]. 

However, the procedure can be improved in a different, and perhaps more ap- 
propriate sense. The confidence coefficient does not in general properly reflect the 
true sense of confidence about a parameter after observing data. For instance, if 
two observations are obtained from a U{0, 9 + 1) family, then the assessment of 6 is 
very precise when the two observations differ in magnitude by nearly 1, while the 
assessment is much less precise if the two observations are close to each other. This 
means that classical confidence intervals fail to indicate the true difference in the 
level of confidence after observing the sample. 

Motivated by this, Kiefer suggested letting the confidence coefficient depend on 
the data. After all, in reality, for a given random interval /, we often want to predict 
the indicator function \{0 e /}. Since this object is unknown, it is traditionally 
estimated by a constant, the best constant being the expectation Pe(^ € /), which 
becomes fixed (or asymptotically fixed) for many classical intervals. However, from a 
prediction theory point of view, it makes more sense to let the predictor of G /} 
depend on the observed data. The predictor considered in this way is called the 
random confidence coefficient associated with the confidence interval /. It is shown 
in [39] that the second stage sample variance can be used to boost the random 
confidence coefficient of a bounded length confidence interval. 

3. Foundations of statistics 

From the examination of individual data points as they relate to the testing prob- 
lem, Ghosh shifted his attention to data summarization, focussing on the relation- 
ship between sufficiency and invariance. Sufficiency isolates features of the collection 
of observations from those of the individual ones which are independent of the fea- 
tures of the collection. Invariance, on the other hand, summarizes data by imposing 
symmetry constraints. In practice, both sufficiency and invariance restrictions are 
applied, but their order of application is an issue of interest. 

Consider a statistical model (X, V) where a group of transformations G is 
acting on the sample space and attention is limited to invariant procedures. To find 
a sufficiency reduction, one needs to find a sufficient sub-cr-ficld of the invariant 
(T-field J^. However, in practice, it is typically easier to invoke invariance on the 
data after it has been reduced by sufficiency. Let =5^ be a sufficient cr-ficld. To justify 
the application of invariance restriction after a sufficiency reduction, it is enough 
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to establish that ^ D is sufficient for ^. This problem was addressed by W. J. 
Hall, R. A. Wijsman and J. K. Ghosh, independently and roughly simultaneously. 
Once they realized they had compatible results, they published a combined paper 
[65]. Their main result can be described briefly as follows. A statistic T is called 
almost invariant if, for every g € G, T{x) = T{gx) a.s. Under conditions that imply 
that every almost invariant set is equivalent, up to null sets, to an invariant set, 
it follows that 5^ and J' arc conditionally independent given 5^ C\ J' , and hence 
n is sufficient for . 

Another notion which relates two sequences of cr-ficlds in sequential experiments 
is that of transitivity, introduced by Bahadur. Two sequences of cr-ficlds SS^ C sin 
are said to be transitive if for every ,^„+i-measurable function /, E(/|^) is ^„ 
measurable. In the usual sequential setting up, 5^n H is transitive for where 
the extra index n indicates the sample size. Several implications of this result were 
discussed in 

In many application areas, sample surveys for instance, discrete models arise, 
where the probability is concentrated on a countable set but the models do not 
have common support, i.e., the support set is different for different parameter val- 
ues. Clearly, such a family is not dominated and the Halmos-Savage theorem on suf- 
ficiency does not hold. Nevertheless, as shown in [2], minimal sufficient cr-ficlds exist 
and the Ncyman factorization theorem holds good. These results were extended for 
pairwise sufficient cr-fields and condition for existence of minimal pairwisc sufficient 
cr-field was found in [-IT]. 

Another basic question is whether a fixed-dimensional sufficient statistic inde- 
pendent of sample size actually exists. In exponential families, it is well-known 
that fixed-dimensional sufficient statistics exist. Outside of exponential families, 
however, sufficient statistics arc hard to find. Some distinguished nonrcgular cases 
like U{~6,9) provide additional examples. In [-54], it is shown that if the support 
{a{6), b{0)) is shrinking or expanding, as in the support of C/(0, 0) for example, then 
the density must be of the form g{9)h{x) to have a real-valued sufficient statistic. 
If a{0) and b{9) are both increasing, or both decreasing, as in U{9,9 + 1), then an 
M^-valued sufficient statistic can exist only in special cases. 

4. Asymptotics 

The asymptotic point of view undergirdcd Ghosh's thinking, even in problems that 
were not primarily focussed on asymptotic properties. In a sense, much of his work 
on sequential analysis, Bayesian analysis and Bayesian nonparametrics are also, at 
least implicitly, work on asymptotics. In fact, many of the most important asymp- 
totic ideas, such as higher order asymptotics and Edgeworth expansions, were pio- 
neered by him. Moreover, in terms of how his thinking progressed, asymptotics can 
be regarded as the next natural conceptual step after thinking about data points 
in sequential analysis, and sufficiency or invariance as a data summarization tech- 
nique. That is, once we have gathered and summarized our data, we want to see 
where it seems to be leading us. 

Ghosh's work on asymptotics can be broadly grouped into seven categories. He 
worked on the Bahadur-Ghosh-Kiefer representation for a quantile. He made foun- 
dational contributions to establishing the existence of Edgeworth expansions. In 
higher order asymptotics, Ghosh examined second order efficiency, Bartlett correc- 
tion and contributed to our understanding of how Wald, Rao and likelihood ratio 
tests compare. Then he turned his attention to Bahadur efficiency and the vexing 
Neyman-Scott problem. 
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Bahadur-Ghosh Kief er representation 

Bahadur represented a sample quantile approximately as an average of i.i.d. random 
variables. To get this representation, Bahadur assumed the existence of two deriva- 
tives of the c.d.f.; the second derivative is bounded and the first derivative is positive 
on a neighborhood of the p-th population quantile S,p. Then Bahadur showed that 
the error in the representation is 0(n~'^/^(logn)'^/^). The order of error in Bahadur's 
representation is nearly sharp, cf. the exact order n~'^/^(logn)"'^/^(loglogn)^/* ob- 
tained by Kiefer. 

One of the reasons this is important is that the order of error is small enough 
to obtain asymptotic normality for the sample quantiles. However, assuming the 
existence of two derivatives is somewhat strong. For instance, it rules out the lo- 
cation family from the double exponential density. On the other hand, for most 
statistical purposes, where only the asymptotic distribution is important, having 
an error term of order Op[n~^/'^) is enough. Therefore it is of interest to weaken 
Bahadur's assumptions at the expense of weakening the conclusion to Op(n~^/^). 
This is possible, even for a variable point Pn depending on n, as shown in [3n], using 
only the assumption of positive first derivative at ^p. 

Actually, the idea of representing a quantile approximately as an average of i.i.d. 
observations occurred to Ghosh independently in the mid sixties at the same time 
as Bahadur was working on the problem. Ghosh was looking at the problem in the 
more general multivariate multisample framework in connection with asymptotic 
normality of multivariate rank tests. He did not record his proof then since it did 
not extend to the multivariate setup at that time. 

4.2. Edgeworth expansions 

Edgcworth expansions are natural refinements of asymptotic normality results in 
that they give error terms of asymptotically smaller order by including more terms 
in addition to the leading normal term. However, for a long time, Edgeworth expan- 
sions were only heuristically justified. In the pioneering paper [7], it is shown that 
under conditions of finiteness of certain moments and a condition on characteristic 
function known as Cramer's condition is the literature, the r-th order Edgeworth 
expansion of a smooth function of sample averages admits error 0(n-('-+i)/2). In 
particular, it follows that for the sample average, finiteness of the 2r-th moments 
is required to justify an Edgeworth expansion of order r. The follow-up paper [8] 
relaxes some moment conditions. A thorough and lucid treatment of Edgeworth 
expansions and higher order asymptotics is given in Ghosh's IMS monograph [.'i2]. 

Another angle on Edgeworth expansions comes from Fisher consistency. Con- 
sider an exponential family with density proportional to cxp[^^^-^ Wj{9)tj{x)]. In 
this context, an estimator T„, which is a function of the fc-dimensional sufRcient 
statistic (X]"=i tii^i): ■ • ■ J J27=i tk{Xi)), is Fisher consistent if Tn{w{9)) — 9. As- 
suming sufRcient smoothness conditions and linear independence of the component 
functions . . . , Wk{9), a Fisher consistent estimator can be written as a smooth 

function of sample averages, and hence has an Edgeworth expansion. In [G2], this 
Edgeworth expansion is compared with that of the MLE, which is another Fisher 
consistent estimator. Interestingly, for any bowl shaped loss function, the MLE has 
better second order risk properties than any other Fisher consistent estimator. Con- 
sequently, this gives a way to discriminate among estimators which are first order 
asymptotically equivalent. This property is called second order efficiency and will 
be discussed in the next subsection. 
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Edgeworth-type expansions need not be restricted to asymptotically normal esti- 
mators. Other limiting distributions can appear naturally. Recall that log-likelihood 
ratio type statistics are among the most common statistics converging to non- 
normal limits such as a chi-square distribution. For locally quadratic functions of 
sample averages, such as the log-likelihood ratio, asymptotic expansions have been 
obtained in [V->]. They have a leading chi-square term. Subsequent terms appear 
as coefficients of powers of nT^^"^ and are finite linear combinations of chi-square 
distributions of degrees of freedoms p, p -I- 2, p + 4, etc., where p is the degree of 
freedom of the leading term. Similar expansions hold even under contiguous alterna- 
tives with non-central chi-squarcs replacing the chi-square leading term as shown in 
[14]. The subsequent terms are finite linear combinations of non-central chi-square 
distributions with degrees of freedoms p, p + 2, p + A, and so forth. 

4-3. Second order efficiency 

Second order efficiency (also called third order efficiency by some authors) is the 
natural way to compare two asymptotically efficient estimators since they are first 
order equivalent. In particular, it was widely believed that the MLE, or some suit- 
able variant of it, had, asymptotically, the smallest possible risk up to the second 
order. Ghosh, among others like Efron, Chibisov, Pfanzagl, Akahira and Takeuchi, 
made pioneering contributions towards rigorous justification of this assertion in 
[G4]. His main result may be roughly described as follows: Let T„ be an efficient 
estimator and consider a modification = Tn +'m(Tn)/n. Then can be beaten 
by 91-^ = On +g{9n)/n, a modification of the MLE where the function g depends 
on T„ and m. Here, by a better estimator, we mean that 

lim n2[E(,{VF(T,;,0)} -Ee{VF(^;,0)}] > 0, 

n — ^oo 

for all e 9, for a truncated squared error loss W . This paper also contains other 
impressive results such as Bhattacharya-type bounds, a Bayesian connection with 
second order efficiency and a notion of second order asymptotic sufficiency. Similar 
results about second order efficiency of the MLE for Pitman closeness and any 
bounded bowl shaped loss functions are given in [G3]. 

In addition to second order efficiency, there is a notion of second order admis- 
sibility. An estimator is second order admissible if there is no estimator which has 
uniformly smaller second order risk with strict inequality for at least one point. In 
[59], for estimators of the form On + g{9n)/n, a necessary and sufficient condition 
for second order admissibility under squared error loss is obtained. 

These second order optimality properties of modified versions of the MLE raise 
the issue whether the MLE has optimality properties beyond the second order. 
A nice counterexample in [GO], however, concludes negatively. On the other hand, 
questions on second order admissibility go beyond the MLE to any BAN estima- 
tor 9n modified to 9n + g{9n) + Op{n~^). The condition for second order Pitman 
admissibility is obtained in [58], and its multiparameter version in [49]. 

Another natural question in the context of second order admissibility is the fol- 
lowing. If two or more statistics are separately second order admissible, for two 
different components of a parameter with bias o(n^^), then, is it true that they are 
jointly second order admissible? The question has a curious answer given in [16]. 
For two dimensions, they are jointly second order admissible, but for three or more 
dimensions, they are not jointly second order admissible. This result is reminiscent 
of Stein's phenomenon on ordinary admissibility with respect to the squared error 



J. K. Ghosh's contribution to statistics: A brief outline 



7 



loss for estimating the normal mean. Intuitively, asymptotically, all regular experi- 
ments are normal experiments and thus a phenomenon under normality continues 
to hold asymptotically under any regular model. The interesting part of the result 
is that the phenomenon shows up in the second order. 

4- 4- Bartlett correction 

Bartlctt introduced a remarkable technique, which bears his name, to improve the 
chi-square approximation to the distribution of a log-likelihood ratio statistic. The 
idea is embarrassingly simple: rescale the chi-square distribution with the second 
order expansion of the mean of the statistic. It is surprising that such a simple 
strategy improves the approximation so much. 

In the seminal paper [9], a variant on Wilks' theorem tuned to the goal of un- 
derstanding the Bartlett correction was presented. Recall that Wilks' theorem is 
the statement that the log-likelihood ratio is asymptotically chi-square. However, 
the chi-square is the result of squaring normals. To see how this might apply to 
the log-likelihood ratio statistic, let Xi, X2, ... be i.i.d. observations from a para- 
metric family governed by 9 ~ {6^ , . . . , 0p) and let L{9) be the log likelihood. For 
j = 1, . . . ,p, let be the MLE of 9 under the nuU hypothesis 9^ = 9^, . . . ,9^ = 9i, 
and let Tj ~ 2n{L{9j^i) — i(^j))^/^sign(^j_i — 9j), where 9o stands for the un- 
restricted MLE. Note that squaring Tj gives the usual object in Wilks' theorem 
with limiting chi-square behavior. However, now, without the square, (Ti, . . . ,Tp) 
is asymptotically normal with error Op(n~'^/^) under the grand null hypothesis. 
This property of Tj gives rise to the Bartlett correction in the multidimensional 
setting. 

Another result developed in that paper is a Bayesian version of the Bartlett 
correction. This is a Bartlett correction to the posterior distribution, conditional 
on the data, obtained by letting the prior tend to the degenerate distribution at the 
true parameter value. The relation between the Bartlctt correction and the Bayesian 
correction gives a deeper understanding of the Bartlett correction phenomenon and 
leads to a variety of generalizations. 

Following this path, [41] studied the asymptotic equivalence of the frequentist 
and Bayesian Bartlett corrections for the likelihood ratio and the conditional like- 
lihood ratio statistic (CLR) introduced by Cox and Reid. In particular, the con- 
ditions for equivalence are instrumental for giving a simple proof of the existence 
of the frequentist Bartlett correction for the CLR statistic. This was extended to 
the multivariate case in [4(1]. A variant on the likelihood ratio called the adjusted 
likelihood ratio (ALR) was introduced by McCuUagh and Tibshirani. In [4")], it was 
shown that the ALR statistic has behavior similar to that of the CLR statistic, in 
that it admits a Bartlett correction and its power under contiguous alternatives is 
equivalent to that of the CLR up to the order 0(71,"^/^). In terms of average power, 
the agreement continues up to o(n^^). 

4-5. Comparison of the likelihood ratio, Wald's and Rao's statistics 

The problem of comparing the likelihood ratio (LR), Wald's and Rao's tests, with 
regard to power has received significant attention in the statistics and econometrics 
literature. It is well-known that, up to the first order of approximation and under 
contiguous alternatives, these three tests have the same local power as dictated 
by the nonccntral chi-square distribution. Discrimination among them, therefore. 
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calls for comparison via higher order power. However, while the LR test is locally 
unbiased up to a higher order of approximation, the same does not hold in general 
for the other two tests. From this perspective, to make them really comparable, 
Ghosh suggested considering locally unbiased versions of Wald's and Rao's tests. 
This work, done under his supervision, eventually led to an optimum property of 
Rao's test in terms of third order local power. A review of these developments is 
available in [■) ] ]. 

In addition, the power properties of the three tests as well as their Bartlett 
adjustability, when they are developed on the basis of a quasi-likelihood rather 
than a true density-based likelihood was discussed in [48]. 

4-6. Bahadui — Cochran deficiency 

To compare the asymptotic performance of two tests, one may look at their 
Bahadur-Cochran relative efficiency, which is the limit, as 5 —> 0, of the ratio of the 
smallest integers which make their levels less than 5. For many pairs of reasonable 
tests, the ratio turns out to be 1. To compare them at a finer level, it is sensible to 
look at their difference, which may be called the Bahadur-Cochran deficiency. The 
limit inferior (or superior) of the difference, reflecting the relative advantage of one 
test over other, was calculated in [12] for some common test statistics. 

4-7. NeymanScott problem and semiparametric inference 

Ghosh made notable contributions in the Neyman-Scott problem also. In the 
Neyman-Scott problem, a new observation Xi is governed by a common parameter 
6 and an additional parameter ^i, depending on i, but only the parameter 9 is of in- 
terest. The problem is notoriously difficult in that common estimators, such as the 
MLE, are usually inconsistent. For instance, if the Xij are independently normally 
distributed with mean fXi and variance <T^,i^l,...,n, j~l,...,k, then the MLE 
of <T^, {nk)~'^ Zir=i H]=i{^i3 - where X, fc^^ converges to the 

wrong value (1 — k~^)a^ . Although it is easy to correct the MLE in this particular 
situation, in general identifying the correction is a hard problem. 

Ghosh proposed constructing an asymptotically efficient estimator for 6 by view- 
ing the as random variables arising from an unknown distribution G. The semi- 
parametric model resulting from this can then be explored to find efficient esti- 
mators for 9. In addition, efficiency in the Neyman-Scott model can be defined in 
terms of the semiparametric model so the two models have many interesting links 
between them. These links may be exploited and are studied in detail in [4] , [5] and 
[6]. 

5. Bayesian inference 

Ghosh has always been very fond of Bayesian ideas and, later in his career, he 
became more convinced that the Bayesian approach to statistics is more natural 
and fruitful. Over the course of his investigations, Ghosh examined each aspect 
of the Bayesian formulation, from construction of a prior to model selection, to 
asymptotic properties. And again, this can be seen as a continuation from his 
asymptotic work. After all, once the asymptotics are developed, we want to see 
how they can be used in a complete inference problem and the Bayesian setting 
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provides a unified context. Indeed, Ghosh's contributions have helped speed the 
development of several branches of Bayesian analysis because of his asymptotic 
orientation. 

Ghosh had always been pragmatic and thought that a good statistical method 
should have good frcqucntist properties as well as sensible conditional properties. 
Moreover, as in the frcqucntist case, asymptotics often play a vital role in Bayesian 
inference and one of the recurring themes in Ghosh's work has been the quest for 
frcqucntist properties of posterior distributions. As one of the leaders in developing 
objective Bayesian methods, he regularly worked to reconcile the two schools of 
thought. The paper [57] elaborately reviews issues and developments in objective 
Bayesian methodology. 

Ghosh's Bayesian work can be broadly grouped into four categories. He has 
worked on frcqucntist matching and other objective priors. He has worked on de- 
termining the limiting behavior of posterior distributions in the parametric context. 
Then, he has turned his attention to richer model classes, examining Bayesian non- 
paramctrics and model selection. 

5.1. Matching and other objective priors 

Ghosh had never been very keen on the term noninformative to describe priors 
that are constructed through some automatic mechanism rather than through a 
subjective assessment of odds. His preference was to use these priors as objective 
or default priors in the absence of genuine subjective information. To him, such 
priors can be obtained by any one of various techniques including matching what 
a frcqucntist might use, invariance, entropy- type maximizations (reference priors), 
approximation, or anything that seemed reasonable. 

The idea of a Bayesian choosing a prior so as to match frcqucntist inferences 
was originally introduced by Peers and Welch, but the term "probability matching 
prior" was first used by Ghosh and Mukcrjee [42] and the approach became popular 
after Ghosh's presentation in the 1990 Valencia meeting. The basic idea is quite 
simple: choose a prior so that Bayesian notions like credibility approximately agree 
with the corresponding frcqucntist notions like confidence level. However, when 
asymptotic normality of the posterior distribution holds (discussed in the next 
subsection) , it means that the variability according to the posterior distribution of 
a parameter is asymptotically equivalent to the sampling fluctuations of the MLE 
in the frcqucntist sense. This implies that Bayes-frequentist matching occurs for 
any prior under minimal restrictions. Consequently, to identify a prior uniquely, 
first order matching of limits is not enough. Satisfyingly, agreement continues to 
the next order, but only if the prior is of a certain form. Thus matching can be used 
to characterize a prior, which may then be thought of as objective at least in the 
sense that it was not chosen according to the personal views of the experimenter. 

Of course, neither Bayesian credibility sets nor frcqucntist confidence sets are 
unique, so when is a scalar, it is natural to look at one sided intervals. If W 
is a properly centered and normalized version of the parameter, then equating 
the posterior probability P^iW < t\Xi, . . . ,Xn) with the frcqucntist probability 
Pg(W < t) for t and ensuring both sides are 1 — a up to o(n^^/^) for each a leads 
to a differential equation. The Jeffreys prior is the solution to this equation. 

A multiparameter version of this frequentist-Bayesian matching was used in [43] . 
In higher dimensions, the components of W may be defined by successively comput- 
ing the regression residual of the current component over the earlier components. 
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Naturally, this depends on the ordering of the parameters, but the dependence is 
not present when the parameters are orthogonal. In these cases, the matching cri- 
terion leads to partial differential equations. Curiously, Jeffreys' prior is a solution 
in some but not all cases: the location-scale problem is an important exception. In 
fact, it is well known that Jeffreys' prior, which is also the left Haar measure, may 
be an inappropriate choice in this case, so the matching criterion genuinely leads 
to sensible solutions even in high dimensional cases. 

The matching prior is closely related to other important objective priors such 
as the reference prior. Reference priors often depend on the role of the parameter; 
nuisance parameters are treated differently from parameters of interest. Interest- 
ingly, in the two parameter cute observation of Ghosh is that the reverse 
reference prior, rather than the reference prior itself, is probability matching. Here, 
by reverse reference prior, we mean the reference prior computed by reversing the 
roles of the parameter of interest and the nuisance parameter. More details and 
discussion of other properties, such as weak minimaxity, may be found in [42]. 

Although matching posterior probabilities does yield useful insight, highest pos- 
terior density (HPD) regions are more efficient credible sets from a Bayesian stand- 
point. Accordingly, matching the coverage probability of HPD regions with the cred- 
ibility is an alternative that might be more appealing to some Bayesians. When this 
matching is done to o(rt~^), it leads to differential equations characterizing prior 
distributions. These were derived in [44]. In some cases, Jeffreys' prior solves these 
equations and so is a matching prior in the sense of coverage probability as well. 
A related paper is [46]. Matching the coverage of one-sided posterior credibility 
intervals for parametric functions up to 0{n^^) was studied in [17]. 

Alternatively, instead of characterizing a prior through matching, one might ask 
if there is some adjustment to make matching work for any prior satisfying mild 
general conditions. Indeed, in [47], it is shown, with examples, that if the center 
of the (1 — a)-HPD ellipsoid is appropriately shifted by a o(n^^/^) amount, where 
the correction is obtained by solving an equation depending on the prior, then the 
resulting perturbed HPD ellipsoid's coverage is 1 — a + o(n^^). 

Of course, there are many sensible notions of objectivity for a prior other than 
matching. Invariance is often the driving force in group models, where a group of 
transformation is acting on the parameter space and the parameter of interest is the 
maximal invariant parametric function. In [] S], a detailed study of various priors 
such as the Chang-Eaves prior for group models is given in the light of matching 
and the marginalization paradox. 

5. 2. Limits of posterior distributions 

One of the most intriguing results in statistics is the Bernstein-von Mises theorem, 
which states that the posterior distribution of the parameter centered at the MLE 
and scaled by ^/n times the square root of the Fisher information converges to 
the standard normal distribution almost surely, as the sample size increases to 
infinity. This parallels the frequentist result that ^/n{0 — ^truc) is asymptotically 
normal with variance given by the inverse Fisher information. In essence, posterior 
normality implies that in an asymptotic sense, at least to first order, any sensible 
Bayesian must agree eventually with frequentist notions of variability. 

Ghosh worked to extend posterior normality in a variety of directions. One natu- 
ral idea is to look at higher order properties so that the usual normal limit is viewed 
as merely the first term in an asymptotic expansion. This parallels the sense in 
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which an Edgeworth expansion is an improvement over the standard central hmit 
theorem. Johnson pioneered such expansions, but the probabihty statements in his 
expansions are in terms of the true distribution of the sample. Often, a Bayesian 
is more interested in bounds that are uniform on sets with high probability in the 
marginal distribution of the sample. In [fil], precise conditions were given so that 
Johnson's expansion of the posterior distribution holds on a set with marginal prob- 
ability 1 — 0(n~''), where r is the extra number of terms in the expansion, i.e., not 
counting the leading normal. It was also shown, by counterexamples, that some of 
the earlier published results in the field are incorrect. 

Sometimes it is meaningful to condition on a statistic rather than the full data to 
obtain the posterior distribution. In particular, since the sample mean is a widely 
used summary measure, it is natural to ask if a version of the Bernstein-von Mises 
theorem holds when the posterior is computed given only the mean. Provided that 
expectation and variance are smooth functions, and the eigenvalues of the covari- 
ance matrix are uniformly bounded and bounded away from zero, it is shown in 
[I j] that a normal limit for the posterior distribution is obtained. The variance of 
the limiting distribution can equal the variance of an observation, but in general, 
the normal limit can differ from that in the usual Bernstein-von Mises theorem, 
unless the sample mean is asymptotically sufficient. The proof is based on an Edge- 
worth expansion for the sample mean and a local limit theorem. The idea extends 
to independent but non-identically distributed observations. 

More broadly, the Bernstein-von Mises phenomenon in a parametric family may 
be seen as the convergence of the posterior density of the standardized parameter 
to a non-degenerate distribution. In general, the centering need not be at the MLE, 
the scaling need not be ^Jn and the limit distribution need not be normal. Indeed, 
in some nonregular families such as the uniform distribution on [0, 6] or the location 
family of the exponential distribution, centering by the Bayes estimator and scaling 
by n yields an exponential limit. This leads to the following question: For which 
families will a limit of the posterior distribution exist? When it does exist, what is 
the correct centering, scaling and limiting distribution? This problem is germane 
to approximating posterior distributions numerically when n is large. 

Under the general setting up of a parametric family considered by Ibragimov 
and Has'minskii in their book, a very elegant characterization was given in [35] and 
[23] in terms of the behavior of the limiting (local) likelihood ratio process of the 
model, Zn{u) = p{X^;9 + r„w)/p(X"; 0), where X" is the observation at stage n 
and r„ is the appropriate normalizer for the problem. Usually X" = (Xi, . . . , X„) 
and n is the sample size. Let Z(u) stand for the weak limit of Zn{u) and ^(u) = 
Z{u) I J Z{v)dv, a random probability density. Under the natural scaling in the 
family, the posterior distribution converges to a limit, after appropriate centering, 
if and only if ^(u) = g(u + W) for some fixed probability density g and a random 
variable W, i.e., as a random element in Li, ^(•) is a random location shift of a 
fixed probability density g. When this holds, g is the limit of the posterior density. 
Clearly, this is a stringent representation, so in many nonregular cases the posterior 
distribution will not have a limit. 

Interestingly, in the regular cases, local asymptotic normality implies that ^(it) = 
g(u + W), in which g is normal and is a random normal shift. Thus this yields a 
Bernstein-von Mises theorem under an extremely general condition. A similar limit 
theorem holds with an exponential limit whenever densities are positively supported 
on an interval [a{d),b{6)], where the support is either expanding or contracting. 

While it is disappointing to find that posterior limits exist only in relatively rare 
cases, it does not rule out the possibility of finding useful approximation to the 
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posterior distributions depending on the sample size n. For change-point problems, 
where the density jumps from a positive value to another positive value at an 
unknown location but is otherwise smooth, a useful approximation was obtained in 
[24] by normalizing an approximation to the likelihood ratio process. It turns out 
that a certain mixture of n many truncated and shifted exponential densities is a 
good approximation. 

5.3. Bayesian nonparametrics 

Ghosh's involvement with Bayesian nonparametrics started in the mid 90's with the 
paper [51] attempting to determine whether the priors used for survival analysis lead 
to consistency under censoring. This paper showed that for the Dirichlet process, 
the posterior under censoring can be represented as a Polya tree process whose par- 
tition depends on the data, and then consistency can be obtained from the tail-free 
property of Polya tree processes. The question is followed up in subsequent papers 
[53] and [36]. Since then, Ghosh has continued to be one of the most important 
contributors to understanding the asymptotics of Bayesian nonparametrics. 

For instance, the search for a noninformative prior for infinite dimensional mod- 
els, as an extension to the finite dimensional case, is ongoing. One approach is 
to generalize the notion of a uniform distribution. This was proposed in [19] us- 
ing uniform distributions on discrete approximations to a space found by maximal 
e-dispersed sets. Even in the parametric setting this approach is fundamental and 
leads to Jeffreys' prior. The approach gives consistency in infinite-dimensional cases. 

More typically, Ghosh was strongly motivated by the examples of inconsistency 
of posterior distributions in infinite dimensional models. While he appreciated those 
illuminating examples, he was always hopeful that Bayes' methods would work if 
priors were constructed properly. He was particularly fond of the KuUback-Leibler 
property which requires that the true distribution be in the support of the prior 
when distances are measured in terms of KuUback-Leibler divergence. That is, the 
prior should assign strictly positive probability to every KuUback-Leibler neighbor- 
hood around the true distribution. 

Because of this, Ghosh thought the Dirichlet process was inappropriate in many 
contexts, despite its evident utility, since its discreteness means it fails to have 
anything in its KuUback-Leibler support. In [20], it was shown that a prior with 
the KuUback-Leibler property, such as a suitable Polya tree or a Dirichlet mixture 
process, can overcome the inconsistency property of Dirichlet processes for esti- 
mating a location parameter. Essentially the same phenomenon appears in linear 
regression models as shown in [1]. In that paper, the first extension of a general 
posterior consistency theorem to independent non-idcntically distributed variables 
is also developed. 

In Bayesian nonparametrics, consistency often combines testing concepts with 
sieves. A celebrated result of Schwartz emphasizes the role of tests for the true 
density /o versus the complement of a neighborhood, say V , around it. The basic 
idea is to construct tests, by covering with many small balls, say Si's, and 
testing /o versus Bi for each i using powerful tests. One can then simply look at 
the maximum of all tests against each small ball, whose type II error probability 
is clearly under control and the type I error probability bounded by the common 
exponential bound for error probability multiplied by the number of small balls 
required to cover V^. Thus the concept of metric entropy, which is the logarithm 
of the number of balls required to cover a set, comes into the picture. Generally 
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is not compact, and it is not possible to cover it by finitely many small balls. 
The difficulty can be overcome by using a sieve, which is a sequence of increasing 
subsets of a parameter space that gradually fill out the whole parameter space. One 
may ignore the portion of the parameter space outside the sieve as long as that part 
has exponentially small prior probability. Now one must control the metric entropy 
of the sieve to ensure that it does not grow faster than a small multiple of n. This 
style of proof gives consistency for density estimation with Dirichlet mixtures of 
normal kernels as shown in [21], providing a large sample justification for the most 
widely used Bayesian density estimator. 

This approach works for density estimation with other priors in place of the 
Dirichlet mixtures. In [(!>>], consistency is obtained for the logistic Gaussian prior 
for a density, that is, a prior on densities obtained by exponentiating and then 
normalizing a Gaussian process. 

The importance of entropy for posterior consistency appeared in [21]. There 
it is seen that in the nonparametric setting prior positivity at the true density 
must be satisfied, but in terms of special neighborhoods given by the KuUback- 
Leibler number. Moreover, it must be possible to choose a sieve whose entropy 
grows no faster than the rate 0{n)^ while ensuring that the prior probability of the 
complement of sieve is exponentially small as n increases. This observation led to 
the derivation of the results on posterior convergence rates in [2")] in the sense that 
the conditions for rates can be viewed as quantitative analogs of the conditions for 
consistency. 

For instance, instead of just requiring that the prior for a fixed e neighborhood 
in the KuUback-Leibler sense has positive probability, one now needs to show that 
the prior probability of the KuUback-Leibler neighborhood of radius e„ is at least 
e""*^", where e„ is the intended rate of posterior convergence. In a similar manner, 
requiring that the e„- entropy of the sieve be bounded by a multiple of ne^ is also 
reminiscent of the condition that the e-cntropy of the sieve should be bounded by a 
small multiple of n. Thus, for fixed e„, this reduces to the condition for consistency. 

The paper also constructs a prior achieving optimal rates of convergence by 
bracketing densities above and below by two functions - choosing a finite collection 
of brackets to provide upper and lower bounds for any probability density in the 
given class. This ensures good approximation of any function within the bracket 
together with a control over likelihood ratios. This can be viewed as a refinement 
of the construction proposed in [] ')]. Other approaches to optimal rates are also 
discussed, most notably, through exponential families generated using a B-spline 
basis. 

Many aspects of Bayesian asymptotics for infinite dimensional models are neatly 
summarized in the review [22], and thoroughly discussed in [52], which to date is 
the only book dealing with asymptotic results in Bayesian nonparametrics. 

5.4- Model selection and Bayesian hypothesis testing 

Testing hypotheses is a major area where frequentist and Bayesian procedures often 
differ substantially. There is a tendency for frequentist methods to over-reject just 
as there is a tendency for Bayes' methods to under-reject, as in the Lindley paradox. 
Results such as the consistency of the Bayesian information criterion (BIG) bridge 
the gap somewhat because the BIG approximates Bayes factors and is frequentist 
consistent for model selection under appropriate conditions in the sense that the 
BIG selects the correct model with probability tending to one. 
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These properties of the BIC are vahd only if the dimension p of the model re- 
mains fixed. However, for many applications, especially for complex data containing 
numerous variables commonly arising nowadays, the BIC fails to approximate the 
Bayes factor adequately and is consistent. The main reason for the failure is ignor- 
ing certain terms in an expansion of the Bayes factor which are not negligible when 
p oo. The difficulty can be avoided by paying proper attention to these terms. 
In [3], a correction is proposed by introducing two more terms, one is proportional 
to p and the other to logp, as well as changing the meaning of the sample size to 
the number of replications. The resulting "generalized BIC" then selects the correct 
model with increasingly high probability. Another generalization of BIC is devel- 
oped in [11] which works in a general exponential family. These generalizations of 
BIC are powerful tools to overcome the challenges posed by high-dimensional data 
problems of contemporary statistics. 

In model selection problems, the definition of optimality is often tricky. An ap- 
pealing approach is comparison with the oracle, that is, with the best procedure 
(for a given loss function) which uses the knowledge of the correct model in making 
decisions. A parametric empirical Bayes (PEB) approach approximates the Bayes 
factor by deriving the rule in a parametric model but estimating the parameters 
in the penalty function by a penalized likelihood criterion with data dependent 
penalty. In [dii], the relative performance of a PEB, the AIC and the BIC were 
thoroughly studied through asymptotics and simulations under both 0-1 and pre- 
diction loss. The conclusion is that the BIC performs badly, but PEB rules perform 
quite satisfactorily, and so does the AIC. If Bayes estimates are used in making 
predictions, instead of least squares estimators, a PEB performs better than the 
AIC. 

One particular difficulty with the Bayes factor is that it is undefined when im- 
proper priors arc used in individual models. Various remedies are proposed in the 
literature, based on the idea of using a part of the information contained in the 
data (training portion) to make priors proper and use the remaining portion in 
Bayesian analysis with the obtained "proper prior" . Since this typically depends on 
the ordering of the data, some kind of averaging, through bootstrap or cross valida- 
tion, over different choices of the training portion is desired. A particularly popular 
candidate among these Bayes factors is obtained by taking a geometric average. 
In [G7], such Bayes factors are studied through asymptotics as the proportion of 
the training sample varies, and conditions for consistency are obtained as the total 
sample size goes to infinity. It turns of that predictive optimality of the "geometric 
Bayes factor" as it is often claimed is not entirely correct. 

There are many other significant papers on model selection authored by Ghosh. 
In [10], optimality of the AIC in inference about Brownian motion is shown. The 
reviews [5G] and [5")] contain wealth of information on Bayesian model selection. 

6. Concluding remarks 

Overall, Ghosh's work in statistics reveals a progression. He began with individual 
data points, proceeded to data summarization, and then to the asymptotics of 
inference. Ghosh's results there were a successful attempt to map out where the 
accumulation of data tend to point. In a sense, asymptotic limits are the ultimate 
data summarization. Then, putting it all together, Ghosh turned to the Bayesian 
formulation, examining each of its components, prior, model, posterior, in turn, to 
permit a comprehensive and unified study of the statistical problem. Indeed, his 
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recent work on Bayesian nonparametrics is a further generalization, again a logical 
step because it builds on his earlier work by using ever richer model classes. 

In fact, Ghosh has worked in many more areas of statistics, apart from those 
outlined above, as well as working on a variety of applications. These topics include 
distribution theory, decision theory, robustness, finite population sampling, reliabil- 
ity, quality control, modeling hydrocarbon discovery, geological mapping and DNA 
fingerprinting. 

Finally, every great researcher has a strategy, a method or a drive, often sum- 
marized in a maxim, that guides or motivates their intellectual endeavors. One of 
Ghosh's maxims was the injunction: "Settle the question!" By this he meant formu- 
late a question so that answering it gives you something definite for the formulation 
of another question. As can be inferred from the progression of his work, his ques- 
tioning led him to an ever broader view of the statistical problem, culminating in 
a Bayesian treatment of high-dimensional models, nonparametric or not. Ghosh's 
injunction to settle questions has helped, and will continue to help, researchers all 
over the world to think deeply about the most important issues. 
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